CRM GROUP PROJECT

Water

INTRODUCTION

  • DATASET
  • Marketing Campaign. Boost the profit of a marketing campaign: https://www.kaggle.com/rodsaldanha/arketing-campaign _____
  • CONTEXT
  • This project is an evaluation assignment for the Customer Relationship Management class within the Master's program in Data Mining and Business Intelligence 2023-2024 at Complutense University of Madrid. The whole project is performed by a group of 4 students and this notebook is just a part of the analysis carried out in Python by one team member. This dataset was originally created to predict who will respond to a product or service offer. A response model can provide a significant boost to the efficiency of a marketing campaign by increasing responses or reducing expenditures. In our case, we are going to leverage this dataset to gain hidden insight of the customers. _____
  • OBJECTIVES
    • MAIN OBJECTIVE: Definition what typologies of customers there are in this company and how to get to sell more to each type.

    • SECONDARY OBJECTIVES: Application of the techniques seen in class:

      • Exploratory analysis of the data.
      • Detection and treatment of outliers and missing variables.
      • Creation of new variables that you consider necessary for the analysis.
      • Dimension reduction for visualization and selection of subsequent algorithms.
      • Clustering (segmentation) of customers.
      • Analysis of associations

    The key point is the business interpretation of this exercise, which be taken into consideration while selecting variables and techniques.

    Import essential libraries

    Load the dataset

    Table of Contents

    TASK 1: EXPLORATORY DATA ANALYSIS

    A. Descriptive Statistics

    Using the info function, we can pull the following information from our dataset :

    • We have 5 categorical variables and 24 numerical variables with 2240 observations
    • We have missing values for the Income variable
    • Wrong data type in Income and Dt_Customer

    Statistics review of categorical data

    Statistics review of numerical data

    The describe function generates for us the 5-Number summary, particularly useful as a first step in our preliminary investigation. Analyzing the statistical summary gives us insightful information in one look :

    • Average income is 52247 dollars while median income is 51300 dollars. The distribution is right skewed with the possible presence of outliers
      • The maximum value being equal to 666666 dollars and the 3rd quartile being only equal to 68522 dollars reinforce this hypothesis
    • Average spending in the last 2 years is 600 dollars while median spending is 396 dollars
    • Average age is 45 years old and the oldest customer is 121 years old which is a pretty (and beautiful) rare event
    • Average number of years of education is 14.4 years which corresponds to a Bachelor degree

    Return Table of Contents

    B. Feature Engineering

    Correct data format

    Having a first look at the row data enables us to start thinking at some useful variables we could create in order to better understand our dataset and reveal precious information.

    We wrill create several variables :

    • Variable Age in replacement of the variable Year_birth
    • Variable Marital_Situation to group the different marital status in only 2 comprehensive categories : In couple vs Alone
    • Variable Children as the total number of children
    • Variable Has_child as a binary variable equal to Yes if the customer has 1 child or more
    • Variable Education_level represents 2 groups of Education: Pregraduate, Postgraduate

    Variables for RFM Model

    • Variable Seniority as the number of months the customer is enrolled with the company
    • Variable Frequency as total number of purchases by all channels, divided by Senority
    • Variable Monetary as the total spending of the amount spent on the 6 product categories

    Check redundancy

  • We will remove the unused variables for this analysis
  • There is no column that contains only 1 unique value (unary data)

    Check duplicates

    There 47 duplicated observations in the dataset, which is a small percentage (2.1%). Therefore, in this case, we will not remove the duplicated observations

    Return Table of Contents

    C. Handling outliers

    An outlier is an observation that differs significantly from other values.
    Outliers can be detected using several methods such as statistical methods or graphical methods. We will use the Box-Plot graphical method which enables us to vizualize the range of our data and plot the outliers. By using this technique, we first calculate the Interquartile Range (IQR) defined as follow : $$IQR= Q_3-Q_1$$ where :

    • Any value greater (lower) 1.5 times the IQR above (below) the third quartile (the first quartile) is defined as a mild outlier
    • Any value greater (lower) 3 times the IQR above (below) the third quartile (the first quartile) is defined as an extreme outlier

    Outliers analysis must be done with care. From above we can see that several variables seem to have outliers :

    • Age: 3 customers older than the Upper Fence set at 74 years old. We will remove the customers that has more than 110 years old
    • Income : Several value are greater than the Upper Fence of 113K. While having an income of 150k is not impossible, we will remove the customer who has an income of 600k (Moreover, this observation is defined as an extreme outlier based on our definition stated previously)
    • Spending : There is only one outlier which is at the limit of the Upper Fence. We will not remove it

    Return Table of Contents

    D. Handling missing values

    As seen earlier, the Income variable has 24 missing values

    Data imputation

    Recheck missing values

    The combined dataframe doesn't have any missing value

    Return Table of Contents

    E. Visual Exploratory Data Analysis

    In this subsection, we are going to visualize the data to find significant patterns

    1. Income vs Accepted Rate

    2. Last campaign

    The maximum number of campaigns accepted is 5

    There are only 10 customers who accepted 5 campaigns and all of them accepted the last campaign (Response)

    3. Campaign vs Products

    In general, clients who have positively responded to the campaigns tend to spend more than those who don't. Campaign 1 and 5 have been more sucessful than others, campaign 3 might be related to Gold Product.

    4. Product preference

    Observations:

  • Customers with PhD spend ~60% of their total spending on wines.

  • Customers with Graduation and Master's spend ~45-50% of their total spending on wines.

  • Customers with Graduation and Master's spend ~27-29% of their total spending on meat.

  • Customers with PhD spend ~25% of their total spending on meat.

  • Customers having education level Master or PhD spend ~80% on meat and wines.

  • Customers with basic education spend more on Fruits, Fish, Sweet, and Gold products.

  • Customers from any listed countries spend between 46-55% of their spending on wines, which is the highest percentage compared to customers' spending on other products.

  • Customers from the US spend 48% of their spending on wines, and 3% of their spending on meat.

  • Customers from the AUS spend 75% their spending on wines, and meat.

  • Customers from the SP spend spend ~ 80% of their spending on wines, and meat.

  • Customers from the IND spend spend ~ 5% of their spending on fruits.

  • Customers from the AUS spend spend ~ 9% of their spending on gold where as customers from teh IND spend ~8% of their spending on gold.

  • We can see that spending on the meat category is correlated with most of the other product categories, except for gold products. Due to the culinary practice present in various cultures, a remarkable correlation is observed between the expenditure on meat and wine.

  • High correlation is found between the monetary amount spent on fish and fruit, which can be associated with the current trend in health care.

  • 5. Channels

  • From the above grouped vertical bar chart, we can conclude that those very high-income customers made more than 50% of their purchases using catalog (buying goods to be shipped through the mail), which rank highest and followed by the purchases made directly in stores (2nd). High-income customers also rank high among all income groups in terms of purchases made through the catalog and directly in stores.

  • Low-income customers rank high in the number of purchases made with discounts, whereas low-income customers make the lowest purchases through the catalog.

  • High-income customers rank almost similar in terms of the numbers of purchases made via all 4 different channels, which are at or above 30%

  • Medium income customers made the highest numbers of the purchase via discount i.e., NumDealsPurchases (number of purchases made with discount)

  • High and very high income customers made more than 65% of the 'in stores" purchases.

  • High and very high income customers made more than 60% of the purchases via company's website

  • We can also visualize the correlation by purchases from different channels and income of the customer.

    Observations:

  • As we know that correlation does not imply causation.

  • NumStorePurchase & NumCatalog purchases show a high positive correlation with customer income (0.7). The customers with high income are most likely to make the highest number of purchases via 'in stores' and 'catalog.'

  • NumWebPurchases has a positive correlation of 0.5 with the variable income.

  • NumDealPurchases has a negative correlation (-0.1) with the variable income.

  • NumCatalogPurchase has a positive correlation of 0.4 with the NumWebPurchase.

  • Return Table of Contents

    CUSTOMER SEGMENTATION

    A. RFM MODEL

    RFM 5x5x5

    B. Customer segmentation with K-means clustering method

    B.1) Segmentation with 4 clusters

    Plotting the results

    Cluster summary

    GROUP 0: Potential loyalists (429 clients)

    The first group consists of 429 clients characterized by an average income and significantly low spending but very recent last purchase. On the other hand, people in this group tend to have teenage children, and buy at discounted prices. In addition, they are very frequent visitors to the website, but they shop more in-store. They have been referred to as "Potential Costumers".

    GROUP 1: Champions (284 clients)

    Cluster 1 represent the group of 284 clients which have been purchasing the company's products for a long time with highest spending across all categories and high income level, which is why we will name this group as "Champions". On the other hand, it is the group that actively accepts most of the campaigns and doesn't tend to have children.

    GROUP 2: Loyal customers (453 clients)

    The 453 clients of cluster 2 are referred to as "Loyal Customers", whose average expenditure and income are remarkably high and ttend to have young children. In terms of purchasing behavior, this group that tends to make frequent in-store purchases.

    GROUP 3: Core customers (1070 clients)

    This group consists of 1070 clients and has the lowest level of spending and income. In addition, it is the youngest group of clients with kids, which doesn't tend to accept marketing campaigns. Moreover, this group of consumers makes most of their purchases in-store and, due to the fact that they have the lowest "Seniority" value and take up to 50% of the total number of clients, we will define this group as "Core customers".

    The "Champions" is the only group of customers that doesn't make complaints.

    RADAR PLOT

    "Champions" group has the highest spending in every product categories, followed by "Loyal customers"

    The most actively responsive group of customers to Marketing campaigns is the "Champions", followed by "Loyal customers".

    The mayority of the clients choose in-store purchase methods. On the other hand, catalog channel is most popular among the "Champions" customers.
    Although haven't made a lot of purchases, the "Core customers" tends to visit the store's website more than any other segments. Meanwhile, the "Potential loyalists" are the most responsive to discounts.

    Return Table of Contents

    B.2) Segmentation with 5 clusters

    CLUSTER 0: Need attention (406 clients)

    Cluster 0, referred to as "Needs attention", consists of 406 people that have a relatively low level of spending and income and their last purchase was the most recent. On the other hand, they have many children, usually teenagers. In addition, it is the group that buys the most at discounted prices.

    CLUSTER 1: Loyal (271 clients)

    As for group number 1 or "Loyal Customers", represented in Figure 23 by the bright blue color, it is known to be made up of 271 former customers with the highest spending in all categories and a considerable level of income. On the other hand, they have teenage children, although they do not stand out for having many kids.

    CLUSTER 3: Potential (409 clients)

    Group 3 is represented by 409 people whose average spending and income do not stand out either positively or negatively. However, they can become good customers, as they are the ones with the highest purchase frequency. In addition, they are the customers with the highest average age. They tend to have teenage children and can be considered the most traditional;i.e., they tend to shop in the store. Therefore, they have been named "Potential Loyalists".

    CLUSTER 2: Core (1016 clients)

    This group is composed of a total of 1016 customers and has the lowest level of expenses and income. The total number of clients is 1016 and has the lowest level of expenses and income. In addition, it is the youngest population with the most children, specifically small children, and the one that accepts the fewest marketing campaigns. This group of consumers is also the one that visits the consumers are the most frequent visitors to the website and, due to the fact that they have the lowest "Seniority" value, we choose to classify them as "Core".

    CLUSTER 4: Champions (134 clients)

    The final cluster is made up of a total of 134 clients characterized by a higher level of income and expenses. In addition, people in this group are of average age and do not usually have children. Due to the fact that they have the highest "Seniority" value, they are classified as "Champions".

    After considering segmentations with 4 clusters and 5 clusters, combined with marketing knowledge, we decide to stay with 4 clusters to carry out further analysis and suggest suitable marketing campaign based on the characteristics of each client group.

    Return Table of Contents

    TASK 2: STATISTIC ANALYSIS

    Spending on Gold Products vs Total spending in stores

    We use Chi-square test to check the relation between Spending on Gold Products vs Total spending in stores

    The rest of this task is performed in RStudio using R

    Return Table of Contents

    TASK 3: DATA VISUALIZATION

    Question 1: ¿Qué campaña de marketing tiene más éxito?

    => Campaign 4 is most effective, attracted the most number of clients to join

    It can be seen from the graphs that for the "Loyal Customers", "Core" and "Potential" groups the most responsive clients to the "Response" campaign, meanwhile, among the "Champion" group, the 5th campaign is the most successful, followed by "Response". In general, the last campaign ("Response") has a better response from consumers than the rest of the campaigns carried out by the company.

    Question 2: Describe the company's average customer

    This question can be interpreted by 3 ways:

    • First of all, we can have a deeper look in the group of customers with the average total spending. As previously specified, in the case of 4 clusters segmentation, they are those that belong to the "Loyal Customers" group.
    • Another way to answer the question would be to look at the biggest client cluster, which are the "Core customers" in this context with a total of 1070 consumer.
    • Finally, the average customer of the company could be defined by finding the 29 average values of the main variables to define consumers. Therefore, in this way, the client has the following characteristics: an income of 51958 monetary units, 46 years old and with at least one child. In addition, its average expenditure is 606 monetary units, its total number of purchases is 15 and it has a postgraduate level. As for their buying behavior, they make more purchases through the store and the website and usually do so at least once a month.

    Question 3: Which products are selling better?

    As can be seen in presented figures, the most purchased product, out of all categories, is wine although the amount spent in the "Core" group is significantly lower.

    Question 4: Which channels are underperforming?

    As for the sales channels, the channel with the lowest performance, in general, is through the catalog. In addition, when studying the behavior of the different groups of customers with respect to the method of purchase, it can be seen that the channel with the lowest performance in the groups called "Loyal Customers", "Core" and "Potential" is also by category. However, for the "Champions" group, the channel with the lowest performance would be the website.

    Return Table of Contents

    RECOMMENDATIONS AND CONCLUSION

    Base on the performed analysis, different recommendations are made with the aim of improving the company's results and customers' satisfaction.

    General recomendations




    Recommendations based on customer segmentation

    Knowing the characteristics and attributes of the four groups clients in the previous section, some personalized marketing strategies are recommended through new campaigns aimed at each consumer segment.




    Return Table of Contents